What’s Cash?

Cash is a C++ embedded domain specific library (EDSL) for hardware design and simulation. It uses template metaprogramming and macro-based reflection to extend the C++ language with hardware specific constructs. Cash enables developpers to describe and simulate their hardware designs in a single source program, leveraging the large ecosystem of C++ development tools and libraries.

cash flow — The Cash Framework Overview.

Build Instructions

Dependencies

Cash requires a C++17 compiler to build and works best with clang 9 to leverage its custom plugin for code reflection.

Other dependencies include:

System Setup

Install Build Essentials:

$ sudo apt-get install build-essential git cmake zlib1g-dev

Install IVerilog:

$ sudo apt-get install iverilog

Install LLVM 9 (Ubuntu 18.04 and above):

$ sudo apt-get install clang-9 libclang-9-dev

Install LLVM 9 (Ubuntu 16.04):

$ wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key|sudo apt-key add -
$ add-apt-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-9 main"
$ apt-get update
$ apt-get install clang-9 libclang-9-dev

Installation

To install Cash you must clone the repository and create a build directory:

$ git clone https://github.com/gtcasl/cash.git && cd cash
$ mkdir build && cd build

Then use run cmake to generate the makefile and export the package informations:

$ cmake ..

Build and install Cash on your system:

$ make -j`nproc` all
$ sudo make install

Test your build

$ make test

Alternative Installation using LIBJIT Compiler

Install LIBJIT dependencies:

$ sudo apt-get install libtool autoconf flex bison texinfo

Build and install LIBJIT:

$ git clone https://git.savannah.gnu.org/git/libjit.git  
$ pushd libjit
$ ./bootstrap
$ mkdir build
$ pushd build
$ ../configure --with-pic
$ make -j`nproc` all
$ sudo make install
$ popd
$ popd

Build and install Cash using ‘JIT=LIBJIT’ configuration option:

$ mkdir build && cd build
$ cmake .. -DJIT=LIBJIT
$ make -j`nproc` all
$ sudo make install

QuickStart Example

Implementing a Generic Matrix Multiply hardware using a systolic array of MAC units as illustrated in the block diagram below.

systolic matmul

Create demo folder
```
$ mkdir demo
$ cd demo
```

Copy Makefile template

$ cp /path_to_project/scripts/Makefile .

Create a file ‘demo.cpp’ that contains the code listing below.

#include <cash/core.h>
#include <assert.h>
#include <iostream>

using namespace ch::core;

// Generic MAC module
template <uint I, uint O>
struct MAC {
  __io (
    __in  (ch_bool)   enable,
    __in  (ch_int<I>) a_in,
    __in  (ch_int<I>) b_in,
    __out (ch_int<I>) a_out,
    __out (ch_int<I>) b_out,
    __out (ch_int<O>) c_out
  );

  void describe() {
    auto sum = io.c_out + ch_mul<O>(io.a_in, io.b_in);
    io.a_out = ch_nextEn(io.a_in, io.enable, 0);
    io.b_out = ch_nextEn(io.b_in, io.enable, 0);
    io.c_out = ch_nextEn(sum, io.enable, 0);
  }
};

// Generic MatMul module
template <unsigned I, unsigned O, unsigned N, unsigned P, unsigned M>
struct MatMul {
  __io (
    __in  (ch_bool)                         valid_in,
    __in  (ch_vec<ch_int<I>, N>)            a_in,
    __in  (ch_vec<ch_int<I>, P>)            b_in,
    __out (ch_vec<ch_vec<ch_int<O>, P>, N>) c_out,
    __out (ch_bool)                         valid_out
  );

  void describe() {
    // systolic 2D array of MAC units
    ch_vec<ch_vec<ch_module<MAC<I, O>>, P>, N> macs;

    // a simple counter
    ch_uint<log2up(N+P+M)> ctr;
    ctr = ch_nextEn(ctr + 1, io.valid_in, 0);

    // MAC array connections
    for (unsigned r = 0; r < N; ++r) {
      auto p = ch_delayEn(io.a_in[r], io.valid_in, r, 0);
      for (unsigned c = 0; c < P; ++c) {
        auto q = ch_delayEn(io.b_in[c], io.valid_in, c, 0);
        macs[r][c].io.enable = io.valid_in;
        macs[r][c].io.a_in = c ? macs[r][c-1].io.a_out.as_int() : p;
        macs[r][c].io.b_in = r ? macs[r-1][c].io.b_out.as_int() : q;        
        io.c_out[r][c] = macs[r][c].io.c_out;
      }
    }

    // output valid?
    io.valid_out = ch_nextEn(ctr == N+P+M-1, io.valid_in, false);
  }
};

static constexpr int InBits = 8;
static constexpr int OutBits = 24;
static constexpr int N = 2;
static constexpr int P = 3;
static constexpr int M = 4;

int main() {

  // a=MxN, b=PxM, c=PxN
  int a[N][M] = { { 0, 1, 2, 3 }, { 4, 5, 6, 7 } };
  int b[M][P] = { { 0, 1, 2 }, { 3, 4, 5 }, { 6, 7, 8 }, { 9, 10, 11 } };
  int c[N][P] = { { 42, 48, 54 }, { 114, 136, 158 } };

  ch_device<MatMul<InBits, OutBits, N, P, M>> matmul;

  ch_tracer tracer(matmul);
  tracer.run([&](ch_tick t)->bool {
    matmul.io.valid_in = true;
    auto j = t / 2;
    for (size_t i = 0; i < N; ++i) {
      matmul.io.a_in[i] = (j < M) ? a[i][j] : 0;
    }
    for (size_t i = 0; i < P; ++i) {
      matmul.io.b_in[i] = (j < M) ? b[j][i] : 0;
    }
    return !matmul.io.valid_out;
  }, 2);

  std::cout << "result = " << matmul.io.c_out << std::endl;

  // Verify  
  for (size_t j = 0; j < N; ++j)  {
    for (size_t i = 0; i < P; ++i) {
      assert(c[j][i] == matmul.io.c_out[j][i]);
    }
  }

  ch_toVerilog("matmul.v", matmul);
  tracer.toVCD("matmul.vcd");

  return 0;
}

Build the program
```
 $ make
```
Run the program
```
 $ demo.out
```
Documentation

Module Class

The top-level description of a hardware block in Cash is a module, described using a C++ struct or class. A valid Cash module should define at least two properties:

A __io() public member describing the module inputs and outputs interfaces.
A describe() public method providing the code description of the hardware logic.

Cash modules can be extended like any other C++ class using inheritance or polymorphism. Likewise, class methods or functions can also be defined to improve abstraction and code reuse.

Our QuickStart example above describes a ‘MatMul’ top module that consumes a 2D array of ‘MAC’ sub-modules. The ‘MatMul’ top module is instantiated in the host ‘main()’ routine using the ‘ch_device<T>’ transfrom. The ‘MAC’ sub-module are instantiated inside the top module using the ‘ch_module<T>’ transform.

Data Types

Category	Description
Primary Types	ch_bit, ch_int, ch_uint, ch_bool
Literal Types	binary, octal, decimal, hexadecimal
IO Types	__in, __out, __interface
Sequential Types	ch_reg, ch_mem
User-Defined Types	ch_vec, __enum, __struct, __union
Component Types	ch_device, ch_module, ch_udf
Extended Types	ch_fixed, ch_float, ch_complex

The Primitive types are the main storage elements for computation in the language with ch_bit<N> representing a collection of consecutive bits. Other primarytypes, boolean (ch_bool), unsigned integer (ch_uint<N>) and signed integer (ch_int<N>) are derivative of ch_bit<N>. Extended types are implemented in the hardware template library as extension to the primiary types. Data types are configurable via C++ templates to specified the bit width of the object.

Literal Types

Cash extends C++ built-in literals with binary literals, octal literals and hexadecimal literals. The size of the literal can be specified explicitly or inferred automatically from the its value. The following code snippet shows the declaration of three literals a, b, and c.

auto a = 1010_b4;    // 4-bit binary
auto b = 4040_o;     // octal with size auto deduced!
auto c = 1080_h128;  // 128-bit hexadecimal

I/O Types

Input/Output types in Cash are implemented using type specifiers to assign a direction of incoming and outgoing signals. __in(T) is used to define a input signal of data type T. __out(T) is used to define a output signal of data type T. The ports interface of a module is declared using __io () construct where all the inputs and outputs are defined. Inside the module’s describe() implementation, the ports interface is access via the io public class member. The following example illustrates the simple us of I/O types for a generic full adder. The io interface holds three inputs ports cin, lhs, and rhs, and two output ports out, and cout.

template <unsigned N>
struct Adder {
  __io (
    __in  (ch_uint1)   cin,
    __in  (ch_uint<N>) lhs,
    __in  (ch_uint<N>) rhs,
    __out (ch_uint<N>) out,
    __out (ch_uint1)   cout
  );

  void describe() {
    auto sum = ch_pad<1>(io.lhs) + io.rhs + io.cin;
    io.out  = ch_slice<N>(sum);
    io.cout = sum[N];
  }
};

Sequential Types

Sequential objects in Cash are defined using generic objects ch_reg<T> and ch_mem<T,N> to declare register and memory objects respectively. The default clock and reset signals are declared implicitly by the compiler. By default, sequential objects are updated on the rising edge of the default clock and the reset is synchronous. Cash uses the ’next-value’ semantic to specify the next state of a register object. The folowing listing shows a Cash implementation of a generic FIFO, configurable by providing the enclosed element data type T and depth N. Register variables ‘rp’ and ‘wp’ are assigned their next value via ‘rp->next’ or ‘wp->next’ member, respectively.

template <typename T, unsigned N>
class Fifo {
  __io (
    (enq_io<T>)             enq,
    (ch_flip_io<enq_io<T>>) deq
  );
  
  static int A = log2ceil(N);
  
  void describe() {
    ch_mem<T, N> ram;
    ch_reg<ch_uint<A+1>> rp(0), wp(0);

    auto r = io.deq.ready && io.deq.valid;
    auto w = io.enq.valid && io.enq.ready;

    auto ra = ch_slice<A>(rp);
    auto wa = ch_slice<A>(wp);
    
    rp->next = ch_sel(r, rp + 1, rp);
    
    __if (w) {
      ram[wa]->next = io.enq.bits;
      wp->next = wp + 1;
    };

    io.deq.bits  = mem[ra];
    io.deq.valid = wp != rp;
    io.enq.ready = wa != ra||wp[A]==rp[A];
  }
};

There are three preferred ways of using registers in Cash: 1) ch_next(obj, init); ch_nextEn(obj, enable, init);

ch_bool x;
auto y = ch_next(x, false);

2) ch_delay(obj, delay, init); ch_delayEn(obj, delay, enable, init);

ch_bool x;
auto y = ch_delay(x, 4, 0);  // 4 cycles shift registers

3) ch_reg;

ch_reg<ch_bool> x(0);
x->next = x + 1;

Use option 1) as much as possible for simplicity if you need a one-cycle latch.

Use option 2) if you need to delay the signal for multiple cycles.

Use option 3) if you need a more complex logic for the register.

User-Defined Types

The Cash DSL supports aggregate types including enums, structs, and unions, defined using __enum(), __struct(), and __union () declarations, respectively. Static vectors are defined using ch_vec<T, N> declaration where T is the enclosed data type and N the number of entries in the container. Composition, inheritance, and templates are also supported on user-defined types to enable the full power of abstraction. The following listing shows a definition of an enum ‘FlitType’, a generic union ‘FlitData’, a struct ‘Flit’, and a vector ‘Flits’.

__enum (FlitType, (
  Invalid, 
  Valid
));

template <unsigned N>
__union (FlitData, (
  (ch_int<N>) vi,
  (ch_float)  vf
));

template <unsigned N>
__struct (Flit, (
  (FlitType)    type,
  (FlitData<N>) data
));

template <unsigned N>
using Flits = ch_vec<Flit<N>, 16>;

Built-in Operators

Name	Description	DataTypes	Category
Equal	==	primary types	Equality
Not Equal	!=	primary types	Equality
Less	<	signed/unsigned types	Relational
Less or Equal	<=	signed/unsigned types
Greater	>	signed/unsigned types
Greater or Equal	>=	signed/unsigned types
Not	!	primary types	Logical
And	&&	primary types
Or	\|\|	primary types
Inverse	~, ch_inv	primary types	Binary
And	&, ch_and	primary types
Or	\|, ch_or	primary types
Xor	^, ch_xor	primary types
Reduce And	ch_andr	primary types	Reduce
Reduce Or	ch_orr	primary types
Reduce Xor	ch_xorr	primary types
Shift Left	<<, ch_shl	primary types	Shift
Shift Right	>>, ch_shr	primary types	Shift
Rotate Left	ch_rotl	primary types	Rotate
Rotate Right	ch_rotr	primary types	Rotate
Neg	-, ch_neg	signed/unsigned types	Arithmetic
Addition	+, ch_add	signed/unsigned types
Subtraction Add	-, ch_sub	signed/unsigned types
Multiplication	*, ch_mult	signed/unsigned types
Division	/, ch_div	signed/unsigned types
Modulus	%, ch_mod	signed/unsigned types
Bit Select	[]	primary types	Subscript
Slicing	ch_slice	primary types	Subscript
Ternary	ch_sel	primary types	Conditionals
Multi-Selection	ch_case	primary types
Minimun	ch_min	primary types
Maximum	ch_max	primary types
Padding	ch_pad	primary types	Resizing
Resizing	ch_resize	all types
Concatenation	ch_cat	all types
Replication	ch_dup	all types
Bit Shuffling	ch_shuffle	all types	Permutations
Reinterpret Cast	ch_as	all types	Cast
Register Cast	as_reg	all types	Cast
Clone	ch_clone	all types	References
Reference	ch_ref	all types
Slice Reference	ch_sliceref	all types
Aligned Slice Reference	ch_asliceref	all types
Group Assigment	ch_tie	all types
Map	ch_map	all types	Higher-Order
Fold	ch_fold	all types
Zip	ch_zip	all types
Scan	ch_scan	all types
Single Latch	ch_next	all types	Buffers
Single Latch w/ enable	ch_nextEn	all types
Delay Buffer	ch_delay	all types
Delay Buffer w/ enable	ch_delayEn	all types
Current Clock	ch_clock	all types	Clock Domain
Current Reset	ch_reset	all types
Push Clock	ch_pushcd	all types
Pop Clock	ch_popcd	all types
Clock Region	ch_cd	all types
Print	ch_print	all types	Debugging
Print NewLine	ch_println	all types
Assertion	ch_assert	all types
Signal Tapping	ch_tap	all types
Current Time	ch_now	all types

Cash implements combinational circuits via C++ operators. Operators are natively supported on the primary and extended data types, but they are also accessible via inheritance on derived I/O and sequential types. The above table presents a classification of most of the operators defined in the DSL. When the bit width of the source operands do not match, the DSL will zero-extend them or sign-extend them depending on their sign. The output bit width is inferred automatically from the operands’ size and the type of the operation. The DSL also provides a function-based API for combinational circuit to supplement existing operators or adding support for operators that are not natively supported in C++ sucb as rotation or bit slicing for instance.

Cast Operators

To cast a variable from one type to the other, static-cast is supported on all primary and extended types using the native C++ static-cast operator. To perform reinterpret-cast, primary and extended types implements a generic method as<U>() for reinterpreting the bits of a variable as a new type U. The following code snippet illustrates the various uses of the cast operators.

ch_int4 obj1 = 0x1;
auto obj2 = static_cast<ch_int8>(obj1); // static cast
auto obj3 = obj1.as<ch_uint4>(); // reinterpret cast
auto obj4 = obj1.as_uint(); // short form of previous line

Instance Operators

Data type Assignments inCash are by reference like in JAVA, this expands supportfor a wider range of design patterns exploiting objectsreassignment. The DSL provides two utilities operatorsclone()andref()to copy or create a pointer to a variable,respectively, during assignments. The following codesnippet illustrates various uses of the instance operators.

ch_uint4 a = 0x0, b = a, c = a.clone(), d = a.ref();
a = 0x1; // only b and d are modified
b = 0x2; // only b is modified
c = 0x3; // only c is modified
d = 0x4; // a is also modified

Control Flow

There are two types of control flow support in Cash: static control flow and dynamic control flow.

Static Control Flow

Static control flows are control flow operations that can be constructed at compile time. Cash implements static control flow usingcombinational MUX circuits. The DSL provides utility functions ch_sel() and ch_case() for describing static control flow as dataflow operators. The following code snippet shows some sample usages of the static control flow operators.

ch_int4 a = 0x1, b = 0x2, c = 0x3;

// x = (a == 0) ? b : ((a == 1) ? c : 0);
auto x1 = ch_sel(a == 0, b, ch_sel(a == 1, c, 0));
auto x2 = ch_sel(a == 0, b)(a == 1, c)(0); // short form
auto x3 = ch_case(a, 0, b)(1, c)(0);       // key-value form

The DSL also extend the C++ control flow statement using __if, __elif __else attributes to represent more complex static control flow blocks. This feature enables nested control blocks as well as local variables to be used (see listing in the next section).

Dynamic Control Flow

Dynamic control flow in hardware is implemented using finite state machines (FSM). FSMs are implemented in Cash using enumeration types and registers for state transitions. The following code snippet implements the body of an FSM with three transition states ‘State::idle’, ‘State::run’, and ‘State::done’

__enum (State, (idle, run, done));

void describe() {
  ch_reg<State> state(State::idle);
  __switch (state)
  __case (State::idle) {
    __if (io.valid) {
      __if (io.count == 0) {
        state->next = State::done;
      }__else {
        state->next = State::run;
      };
    };
  }
  __case (State::run) {
    state->next = State::done;
  }
  __case (State::done) {
    state->next = State::idle;
  };
}

Interfaces

The Cash framework supports a port interface structures for delaring an interface as a type outside the module. This feature allows definition of interfaces to share between hardware modules inside a project. Interfaces can be nested as a member of another interface. It is also possible to implement inheritance with interfaces. You declare an interface using the __interface () construct inside which you place all your I/O ports.

Binding interfaces

One of the advantages of using interfaces is for automatic binding or bulk connection where to connect two interfaces you don’t have to explicitly connect each port, but simply bind the interface directly and let the compiler infer the correct connection betwen the nested ports. To bind two interfaces, you use the C++ call operator().

The following example defines three interfaces link_io, plink_io which derives from link_io, and filter_io which uses plink_io as nested member. The binding of these interfaces is illustrated inside module FilterBlock when connecting sub-module Filter instance f1_, and f2_.

template <typename T>
__interface (link_io, (
  __out (T) data,
  __out (ch_bool) valid
));

template <typename T>
__interface (plink_io, link_io<T>, (  // using inheritance
  __out (ch_bool) parity
));

template <typename T>
__interface (filter_io, (
  (plink_io<T>) x,              // nesting interfaces    
  (ch_flip_io<plink_io<T>>) y   // using flipped interfaces
));

template <typename T>
struct Filter {
  filter_io<T> io;
  void describe() {
    auto tmp = (ch_pad<1>(io.x.data) << 1)
              | ch_pad<1>(io.x.parity);
    io.y.data   = ch_delay(ch_slice<T>(tmp), 1, 0);
    io.y.parity = ch_delay(io.x.data[ch_width_v<T>-1], 1, 0);
    io.y.valid  = ch_delay(io.x.valid, 1, 0);
  }
};

template <typename T>
struct FilterBlock {
  filter_io<T> io;
  void describe() {
    f1_.io.x(io.x);        // binding interfaces
    f1_.io.y(f2_.io.x);
    f2_.io.y(io.y);    
  }
  ch_module<Filter<T>> f1_, f2_;
};

Clock Domains

The Cash DSL supports using defining custom clock and reset signals via clock domains. The DSL provides a stack-based interface for modifying the current clock domain. This is done using ch_pushcd(clock, reset, posedge) and ch_popcd() built-in functions. The following example illustrates the use of clock domains with two user-defined clocks and a user-defined reset signal. The generated verilog program and VCD trace is also shown.

#include <cash/core.h>
#include <assert.h>
#include <iostream>

using namespace ch::core;

#define P_CLK1  1
#define P_CLK2  2
#define P_START (std::max(P_CLK1, P_CLK2) * 2)
#define P_STOP  (P_START * 5)

template <unsigned N>
struct MyModule {
  __io (
    __in  (ch_bool)    clk1,
    __in  (ch_bool)    clk2,
    __in  (ch_bool)    reset,
    __in  (ch_uint<N>) din,
    __out (ch_uint<N>) dout
  );

  void describe() {
    // posedge clk1, negedge reset
    ch_pushcd(io.clk1, ~io.reset, true); 
    ch_reg<ch_uint<N>> x(io.din);
    ch_popcd();

     // negedge clk2, negedge reset 
    ch_pushcd(io.clk2, ~io.reset, false); 
    ch_reg<ch_uint<N>> y(io.din);
    ch_popcd();

    // logic
    x->next = x + 1;
    y->next = y + 1;
    io.dout = x + y;    

    // add local variables to debug trace
    __tap(x); 
    __tap(y);
  }
};

int main() {
  ch_device<MyModule<8>> my_module;

  ch_tracer tracer(my_module);

  tracer.run([&](ch_tick t)->bool {
    switch (t) {
    case 0:
      my_module.io.clk1  = 0;
      my_module.io.clk2  = 1;
      my_module.io.reset = 0;
      my_module.io.din   = 7;
      break;
    default:      
      if (0 == (t % P_CLK1)) my_module.io.clk1  = !my_module.io.clk1;
      if (0 == (t % P_CLK2)) my_module.io.clk2  = !my_module.io.clk2;
      if (t >= P_START)      my_module.io.reset = 1;
      break;
    }
    return (t <= P_STOP);
  });

  std::cout << "result = " << my_module.io.dout << std::endl;

  ch_toVerilog("my_module.v", my_module);
  tracer.toVCD("my_module.vcd");

  return 0;
}

Generated Verilog

module MyModule(
  input wire io_clk1,
  input wire io_clk2,
  input wire io_reset,
  input wire[7:0] io_din,
  output wire[7:0] io_dout
);
  reg[7:0] x_11, y_16;
  wire _inv_7;
  wire[7:0] _add_19, _add_21, _add_22, x, y;

  always @ (posedge io_clk1) begin
    if (_inv_7)
      x_11 <= io_din;
    else
      x_11 <= _add_19;
  end
  always @ (negedge io_clk2) begin
    if (_inv_7)
      y_16 <= io_din;
    else
      y_16 <= _add_21;
  end
  assign _inv_7 = ~io_reset;
  assign _add_19 = x_11 + 8'h1;
  assign _add_21 = y_16 + 8'h1;
  assign _add_22 = x_11 + y_16;
  assign x = x_11;
  assign y = y_16;

  assign io_dout = _add_22;

endmodule

Generated VCD trace

systolic matmul

User-Defined Function

The Cash extension API is mainly driven via User-Defined Functions (UDF). This interface allows programmers to extend the base API functionalities by defining their own functions to integrate with the rest of the framework. This facility is particularly important for two scenarios:

When prototyping new hardware and we are only interested in the cycle-level or functional modeling of sub-components.
When importing existing IPs given their functional implementation written in pure C++ or other frameworks like System C or even an existing HDL component written Verilog.

The DSL extension namespace implements two generic interfaces ch_udf_comb<T> and ch_udf_seq<T> for instantiating combinational or sequential user-defined functions respectively. A user-defined function is implemented using a struct or class, similar to how modules are described in Cash. They should provide an eval() method instead of the describe() through which the user will implement the desired combinational or a functional model of their component. Functional modeling is done using the combinational ch_udf_comb<T> interface. Cycle-level modeling is done using the sequential ch_udf_seq<T> interface. Internally, the Cash simulator will call into the specified extension based on its execution model.

Functional-Level Modeling

The listing below is a simple example that implements a functional design for an integer division extension ‘MyDiv’ using user-defined functions. The I/O ports ‘lhs’, ‘rhs’, and ‘dst’ are system space ports directly accessible by the host application. The eval() method implements the functional model for integer division using C++ directly. The example also includes an ALU test module to illustrate how to instantiate and use the user-defined function inside a Cash module using the ch_udf_comb<T> interface.

struct MyDiv {
  __io (
    __in  (ch_int32) lhs,
    __in  (ch_int32) rhs,
    __out (ch_int32) dst,
  );
  
  void eval() {
    io.dst = io.lhs / io.rhs;
  }

  void from_verilog(std::ostream& o) {
    o << "assign $io.dst = $io.lhs / $io.rhs;";
  }
};

struct ALU {
  __io (
    __in  (ch_int32) a,
    __in  (ch_int32) b,
    __out (ch_int32) c,
  );
  
  void describe() {
    ch_udf_comb<MyDiv> div;
    div.io.lhs = io.a; 
    div.io.rhs = io.b;
    io.c = div.io.dst;
   }
};

Cycle-Level Modeling

The listing below is a simple example that implements a cycle-level design of a custom AES encryption accelerator via user-defined functions. The extension internally uses an existing cycle-level AES simulator which implements a tick() method for advancing its internal states. The ‘ch_udf_seq’ Cash interface is used in this case to tell the compiler that this extension supports sequential execution and should be invoked by the simulator on a per-clock cycles basis.

struct AES {
  __io (
    __in  (ch_int128) plaintext,
    __in  (ch_int128) key,
    __out (ch_int128) ciphertext,
  );
  
  void eval() {
    io.dst = sim_.output();
    sim_.input(io.plaintext);
    sim_.key(io.key);
    sim_.tick(); // advance clock
  }
  
  AES_CAS_Simulator sim_;
};

struct SoC {
  __io (
    __in  (ch_int128) a,
    __in  (ch_int128) b,
    __out (ch_int128) c,
  );
  
  void describe() {
    ch_udf_seq<AES> aes;
    aes.io.plaintext = io.a; 
    aes.io.key = io.b;
    io.c = div.io.ciphertext;
   }
};

Verilog IP Import

User-defined functions also allow existing Verilog code to be provided as part of the extension description. This is done via the from_verilog() method that the user should implement to provide the code for the Verilog components. In our integer division code in the above listing, the from_verilog() method demonstrates how a custom Verilog code could also be included to perform the same division. It is also possible to simply provide the path to a Verilog program file and have the framework load it directly.

When an eval() method is also provided, the Cash compiler invokes the provided function during simulation and uses the Verilog code only during codegen, merging it with the rest of the generated Verilog program. When no eval() method is provided, Cash automatically generates the simulation stub to execute the Verilog code. The compiler internally uses Verilog VPI to communicate with the external Verilog modules using any user-provided Verilog simulator.

Hardware Template Library

The Cash hardware template library (HTL) is a repository of generic reusable components that are provided to construct hardware blocks in a standardized and efficient manner to boost productivity. The HTL currently includes hardware queues, arbiters, crossbars, counters, encoders, decoders, pipe registers, muxes, fixed-point, floating-point, complex numbers. The HTL objects are defined under the ‘ch::htl’ namespace and are added to the project by including their header file. Our MatMul QuickStart example illustrates the use of the ch_counter object from the HTL.

RTL Simulation

The Cash DSL exposes an interface to the high-speed simulator via the ch_simulator object to give developers fine-grain-control over the simulation execution. The interface implements three relevant functions:

eval(): for fine-grain invocations every time ticks
step(): for cycles-level invocations at clock edges
run(): for multi-cycles system-level invocations

There is also a tracer object ch_tracer which extends from ch_simulator to provide tracing capabilities to the simulator. The ch_tracer object implements the following functions to generate various traces for debugging:

toText(file): creates a text file with trace information
toVCD(file): creates a VCD trace file
toVerilog(file): creates a Verilog testbench that simulates the execution trace
toVerilator(file): creates a Verilator testbench that simulates the execution trace
toSystemC(file): creates a SystemC testbench that simulates the execution trace

There are three ways of invoking the Cash simulator:

1) Single-run mode: when the input values do not need to change during the simulation.

int main() {
  ch_device<MyModule<ch_bit2, 2>> my_device;
  ch_simulator simulator(my_device);
  my_device.io.din  = 1;  // assign inputs
  my_device.io.push = 1;
  simulator.run(20);  // run the simulation for 20 cycles
  assert(my_device.io.full == true);  // check outputs
  return 0;
}

2) Callback mode: when the input values have to change during the simulation or when you need to check your output at a specific time. You use a lambda callback function to intercept the simulator and apply your changes before every simulation step.

int main() {
  ch_device<MyModule<ch_bit2, 2>> my_device;
  ch_simulator simulator(my_device);
  simulator.run([&](ch_tick t)->bool {
    switch (t) {
    case 0:
      my_device.io.din  = 1;  // assign inputs
      my_device.io.push = 1;
      break;      
    case 2:
      assert(my_device.io.full == false);  // check outputs
      my_device.io.din  = 2;
      my_device.io.push = 1;
      break;
    case 4:
      assert(my_device.io.full == true);  // check outputs
      break;
    }
    return (t <= 4);
  });
  return 0;
}

3) Stepping mode: when the input values have to change during the simulation or when you need to check your output at a specific time. You can directly invoke the simulation steps.

int main() {
  ch_device<MyModule<ch_bit2, 2>> my_device;
  ch_simulator simulator(my_device);
  simulator.reset();                  // invoke clock reset sequence
  my_device.io.din  = 1;              // assign inputs
  my_device.io.push = 1;
  simulator.step(2);                  // advance one cycle (2 ticks)
  my_device.io.din  = 2;
  my_device.io.push = 1;
  simulator.step(2);                  // advance one cycle (2 ticks)
  assert(my_device.io.full == true);  // check outputs
  return 0;
}

Hardware Diagnostics

The Cash DSL provides the following diagnostic APIs to verify the hardware design at runtime:

ch_assert(cond, msg): inserts an assertion into the hardware to check a specific condition.
ch_tap(obj): inserts trace monitor on a specific variable.
ch_print(fmt, args): print a formatted text to the console output.
ch_cout(): print a formatted text to the console output using a straem interface.

Cash projects can also leverage existing C++ unit test framework like Google Test, Boost Test, or Catch for large-scale projects.

Architecture Simulators Integration

The step() function of the simulator object is the preferred choice when simulating a Cash model inside in CAS-based architecture simulators like GEM5 and SST. In GEM5, this can be done via the processEvent() of SimObject objects. In SST and Manifold, it is done by handling registered clock event callbacks on Component objects. The following listing illustrates the implementation of a simple SST component simulating a Cash model.

struct MyComponent : public Component {
  MyComponent(...) {
    ch_device<Adder<4>> my_adder;
    my_sim = std::make_shared<ch_simulator>(my_adder);
    auto clk = get_current_clock();
    registerClock(clk, &MyComponent::tick);
  }

  void tick() {
    my_sim->step(); // invoke Cash simulator
  }
  
  std::shared_ptr<ch_simulator> my_sim;
};

HDL Codegen

The Cash DSL cuurently provides two main functions for exporting HDL code:

ch_toVerilog(file, device): exports the provided Cash device to Verilog HDL.
ch_toFIRRTL(file, device): exports the provided Cash device to FIRRTL.

Our MatMul QuickStart example illustrates the use of the ch_toVerilog function to generate Verilog HDL.

High-Level Synthesis

High-Level Synthesis (HLS) Tools like Intel Quartus support a compiler that can convert OpenCL programs to RTL. The compiler provides an extension API for referencing custom RTL components inside OpenCL kernels and using them as external libraries during the synthesis flow. This interface can be used to optimize OpenCL kernels by providing fine-tuned RTL implementation ofsome components. Another application is for profiling existing RTL models since Quartus already provides the application software and system components to support the FPGA. RTL libraries in Quartus interact with the OpenCL kernel via the Avalon Bus Interface.

The Cash library implements a generic implementation of the Avalon interface and utility functions to provide a productive development environment for designing OpenCL RTL libraries. The libray also implements an Avalon bus extension for the Cash simulator that enables the simulation of the OpenCL RTL libraries with the Cash environment. The same extension also provides support for simulation of RTL libraries inside the OpenCL emulator.

The Cash HTL implements the following objects for HLS integration:

avm_reader<T, N>: the Avalon interface for reading from memory.
avm_writer<T, N>: the Avalon interface for writing to memory.
avm_slave_driver<Cfg>: The Avalon slave interface for simulating memory transfer.

The following example illustrates a Sobel filter OpenCL Wrapper interface ‘sobel_ocl’ uses the avm_reader and avm_writer interfaces. You may find an implementation of the Sobel filter in ‘examples’ folder in the Cash’s source repository. The simulation code in the ‘main()’ function shows how the avm_slave_driver interface is used to simulate the Avalon bus interface.

#include <sobel.h>
#include <cash/eda/altera/avalon.h>

using namespace eda::altera::avalon;

template <typename T>
class sobel_ocl {
public:
  __io (
    __in (ch_uint<avm_v0::AddrW>) dst,
    __in (ch_uint<avm_v0::AddrW>) src,
    __in (ch_uint32)              count,
    (avalon_st_io)                avs,
    (avalon_mm_io<avm_v0>)        avm_dst,
    (avalon_mm_io<avm_v0>)        avm_src
  );

  __enum (ctrl_state, (idle, running, done));

  sobel_ocl(uint32_t width, uint32_t height, uint32_t pipelen)
    : core_(width, height, pipelen)
  {}

  void describe() {
    ch_reg<ctrl_state> state(ctrl_state::idle);

    __switch (state)
    __case (ctrl_state::idle) {
      __if (io.avs.valid_in) {
        state->next = ctrl_state::running;
      };
    }
    __case (ctrl_state::running) {
      __if (core_.io.done) {
        state->next = ctrl_state::done;
      };
    }
    __case (ctrl_state::done) {
      __if (!avm_writer_.io.busy && io.avs.ready_in) {
        state->next = ctrl_state::idle;
      };
    };

    auto start = io.avs.valid_in && io.avs.ready_out;
    io.avs.ready_out = (state == ctrl_state::idle);
    io.avs.valid_out = (state == ctrl_state::done) && !avm_writer_.io.busy;

    avm_reader_.io.base_addr = io.src;
    avm_reader_.io.start = start;
    avm_reader_.io.count = io.count;
    avm_reader_.io.avm(io.avm_src);
    avm_reader_.io.deq(core_.io.in);

    avm_writer_.io.base_addr = io.dst;
    avm_writer_.io.start = start;
    avm_writer_.io.done = (state == ctrl_state::done);
    avm_writer_.io.avm(io.avm_dst);
    avm_writer_.io.enq(core_.io.out);
  }

private:

  ch_module<sobel_core<T>> core_;
  ch_module<avm_reader<T>> avm_reader_;
  ch_module<avm_writer<T>> avm_writer_;
};

int main() {
  uint32_t width, height;
  std::vector<uint8_t> src_image;

  if (!readImage(src_image_file, &width, &height, &src_image))
    return -1;

  auto num_pixels = width * height;
  auto num_blocks = ceildiv(num_pixels, 64);
  src_image.resize(num_blocks * 64);

  auto dst_image_size = num_blocks * 64;

  std::vector<uint8_t> dst_image(dst_image_size, 0);

  ch_device<sobel_ocl<ch_uint8>> device(width, height, 4);
  
  // setup Avalon salve driver
  avm_slave_driver<avm_v0> avm_driver(2, 128, 84);
  avm_driver.bind(0, device.io.avm_src, src_image.data(), src_image.size());
  avm_driver.bind(1, device.io.avm_dst, dst_image.data(), dst_image.size());

  // run simulation
  ch_tracer tracer(device);
  device.io.avs.valid_in = false;
  device.io.avs.ready_in = false;

  auto ticks = tracer.run([&](ch_tick t)->bool {
    if (2 == t) {
      // start simulation
      device.io.dst = 0;
      device.io.src = 0;
      device.io.count = num_pixels;
      device.io.avs.valid_in = true;
      device.io.avs.ready_in = true;
    }
    // tick avm driver
    avm_driver.tick();

    // stop simulation when done
    return !device.io.avs.valid_out;
  }, 2);

  // flush pending requests
  avm_driver.flush();

  std::cout << "Simulation run time: " << std::dec << ticks/2 << " cycles" << std::endl;

  ch_toVerilog("sobel_ocl.v", device);
  tracer.toVCD("sobel_ocl.vcd");

  return 0;
}

Examples

The Cash source repository includes the following examples.

adder: A simple binary adder module with carry in and out.
counter: A simple binary counter showcasing register use in Cash.
fastmul: A fast interger multiply module using read-only memory.
gcd: A GCD module showcasing control flow in Cash.
vending: A state machine exmaple.
fifo: A basic First-in-First-out module.
aes: A generic AES encryption module.
sobel: A simple sobel image filter.
vectoradd: A simple Vector Add module.
fft: A pipelined Radix2 FFT processor.
sqrt: A simple Sqrt module showcasing UDF use in Cash.

You can execute any example manually using he following command:

  $ cd build/examples
  $ ../bin/adder

Test Suite

The Cash source repository includes a unit-test suite for validating the DSL, compiler, and codegen functionalities. The Test suite is currently integrated with Travis Constant Intergration and Codecov code coverage frameworks.

You can execute the unit-test manually using he following command:

  $ cd build/tests
  $ ../bin/testsuite

Contributions

Contributions to this codebase are welcome, please email me at blaise.tine@gmail.com.

License

Release under the BSD license, see LICENSE for details.